Minimal Retrieval-Augmented Generation (RAG)

Published on July 10, 2025

By : Mohammed KAIDI

⚙️ Minimal Retrieval-Augmented Generation (RAG) in Python

🔍 Definition

RAG (Retrieval-Augmented Generation) is an architecture that combines:

Retrieval – Fetching relevant documents from an external knowledge source (vector database, etc).
Generation – Using a Large Language Model (LLM) to generate answers based on the retrieved context.

💬 Instead of inventing an answer, the AI searches for accurate content and responds using it.

🧩 Key Components

Component	Description
Vector Store	Stores vector embeddings of your documents. Examples: FAISS, Pinecone, Qdrant
Embeddings	Numerical representation of texts, e.g. using `text-embedding-3-small`
LLM	Large Language Model like GPT-4, Claude, Mistral
Orchestrator	Optional. Helps coordinate retrieval + generation. Examples: LangChain, LlamaIndex

🧰 Use Cases

1. 🤖 Chatbot with Private Docs

A chatbot that answers using your internal PDFs, Notion pages, or company wiki.

2. 🧾 Smart Customer Support

Connect RAG to support tickets + product docs → auto-reply bot mimics a real agent.

3. 📚 Legal or Scientific Knowledge Search

Query huge legal texts or medical research databases using natural language.

4. 🧠 Memory-Augmented Assistant

A personal assistant that remembers and queries from indexed personal notes/emails.

5. 🏢 Enterprise Semantic Search

Ask any company-related question → get the best matching doc snippet as an answer.

⚖️ RAG vs Pure LLM

Pure LLM	RAG
Relies on pre-trained knowledge only	Can query up-to-date external sources
May hallucinate answers	Provides grounded, verifiable responses
Updating requires fine-tuning	Just add/update documents in your index

This guide explains how to build a minimal RAG (Retrieval-Augmented Generation) system using Python, FAISS, and OpenAI. It performs:

Document ingestion
Embedding generation (via OpenAI)
Vector search (via FAISS)
Answer generation (via GPT-4)

🧱 Stack

Python
OpenAI API (for embeddings + generation)
FAISS (for vector similarity search)
tiktoken (optional, for token counting)

📦 Installation

Use a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install openai faiss-cpu tiktoken

🔐 Setup OpenAI Key

Create a .env file at the root of your project:

OPENAI_API_KEY=sk-xxxxxxx

Install python-dotenv:

pip install python-dotenv

📄 Python Code (rag.py)

import faiss
import openai
import numpy as np
import os
from dotenv import load_dotenv

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

docs = [
    "Les notes de frais doivent être scannées et envoyées par email à compta@entreprise.com avant le 5 de chaque mois. N'oubliez pas d'inclure les justificatifs.",
    "Les congés doivent être validés par le manager.",
    "Le remboursement des frais est effectué sous 7 jours ouvrés après réception de la note."
]

def get_embedding(text):
    res = openai.Embedding.create(input=[text], model="text-embedding-3-small")
    return np.array(res['data'][0]['embedding'], dtype='float32')

doc_embeddings = [get_embedding(d) for d in docs]

dim = len(doc_embeddings[0])
index = faiss.IndexFlatL2(dim)
index.add(np.stack(doc_embeddings))

query = "Comment envoyer une note de frais ?"
query_vector = get_embedding(query)

D, I = index.search(np.array([query_vector]), k=1)
relevant_doc = docs[I[0][0]]

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Tu es un assistant RH."},
        {"role": "user", "content": f"Voici un extrait de document : "{relevant_doc}". Réponds à la question : "{query}""}
    ]
)

print(response['choices'][0]['message']['content'])

✅ Expected Output

Les notes de frais doivent être scannées et envoyées par email à compta@entreprise.com avant le 5 du mois. N'oubliez pas d'inclure les justificatifs.

🚀 Next Steps

Ingest PDFs with PyMuPDF or pdfplumber
Replace FAISS with Qdrant or Pinecone
Build a frontend using Flask, Next.js or Node.js
Optimize chunking with tiktoken for long documents

🧠 What is RAG?

RAG = Retrieval-Augmented Generation

Retrieval: Fetch relevant chunks from your documents
Generation: Use LLM (like GPT-4) to generate an answer based on those chunks

This avoids hallucinations and gives grounded, controllable answers.

❓ Does ChatGPT Store My Data?

ChatGPT (web app or mobile app):
- By default, your chats may be used to improve models unless you turn off history.
- Go to ChatGPT Settings → Disable Chat History & Training.
- When disabled: ✅ Conversations are not stored or used to train models.

⚙️ What About the OpenAI API?

Using the OpenAI API (e.g. via openai.ChatCompletion.create()):
- Your data is not stored.
- Your data is not used to train or improve models.
- ✅ API usage is isolated per request and discarded after processing.

Source: OpenAI API Data Usage Policy

🏢 For Enterprise Use (Sensitive Data)

Use one of these options:

Option	Description
OpenAI Enterprise	Full privacy, zero retention, enterprise-grade security (SOC 2, ISO 27001, etc.)
Azure OpenAI	Hosted by Microsoft, strict data residency and compliance options
Self-hosted LLM	Deploy models like Mistral, LLaMA2, Mixtral locally or on a private cloud

🧠 Summary

✅ API usage is private and safe for enterprise
🛑 Avoid sending sensitive data via the ChatGPT app unless chat history is off
🧱 For total control: go with OpenAI Enterprise, Azure, or on-premise LLM

✅ Best Practices for Building RAG or LLM Apps

Use OpenAI API, not the client-side SDK, for sensitive inputs.
Do all vectorization and generation on the server.
Optionally anonymize, redact, or encrypt sensitive parts of user input before sending.